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(54) Pipeline protection 

(57) A processing engine including a processor 
pipeline 820 with a number of pipeline stages 822-828, 
a number of resources and a pipeline protection mech- 
anism 838. The pipeline protection mechanism 
includes, for each protected resource, respective arbi- 
tration logic 886 for anticipating access conflicts for that 
resource between the pipeline stages. An output of 
each arbitration logic is connected 888,889 to form stall 
control signals for controlling the selective stalling of the 
pipeline to avoid the resource access conflicts. The 
resources could, for example, be registers or parts 
(fields) within registers. By providing arbitration logic for 
each resource, an embodiment of the invention effec- 
tively enables a distribution of the control logic needed 
to anticipate potential resource access conflicts, and 
allows selectively stalling of the pipeline to avoid the 
conflicts from actually occurring. 




CD 
CD 
CO 
CM 
CD 
CD 

O 
Q. 
LU 



1 



EP 0 992 896 A1 



2 



Description 

BACKGROUND OF THE INVENTION 

[0001] The present invention relates to pipeline 
processor design, more especially to protecting a proc- 
essor pipeline against conflicts. 
[0002] Typically, modem processing engines, such 
as are found in digital signal processors (DSP's) or 
microprocessors employ a pipelined architecture in 
order to improve processing performance. A pipelined 
architecture means that various stages of instruction 
processing are performed sequentially such that more 
than one instruction will be at different stages of 
processing within the pipeline at any one stage. 
[0003] Although a pipelined architecture does allow 
higher processing speed than would be possible if the 
processing of one instruction were to be completed 
before the processing of another could be started, this 
does lead to significant complications regarding poten- 
tial conflicts in operation. Conflicts may occur between 
resource accesses, for example in a situation where a 
second instruction attempts to access a register or a 
part of a register before a first instruction has finished 
operations on that register, whereby the second instruc- 
tion might receive invalid data. 
[0004] Such potential conflicts are often termed 
"data hazards". Examples of possible data hazards are 
in cases of, for example: 



read after write 


(ex: 


ARx 


= ARy 


followed by 


*ARx = k16) 










write after read 


(ex: 


ARx 


= ARy 


followed by 


mar(ARy=P16)) 










write after write 


(ex: 


ARx 


= ARy 


followed by 


mar(ARx=P16)) 











[0005] Various techniques for hardware pipeline 
protection are known in the art. 
[0006] One example is termed "scoreboarding". 
With scoreboarding each register or field can have 
pending writes and reads qualified with their execution 
phase using a table, or scoreboard. However, such an 
approach can be complex to handle and expensive in 
terms of logic overhead and, as a consequence, in 
power consumption. Particularly in processing engines 
designed for portable applications or applications pow- 
ered other than by the mains (e.g. , battery or other alter- 
natively powered applications), such an approach is 
undesirable. Moreover, a scoreboarding approach rap- 
idly becomes unwieldy when the processing engine has 
a large instruction set and/or a parallel processing archi- 
tecture. 

[0007] Other approaches can employ read/write 
queuing. However, such an approach is unsuitable 
where there is a wide variety of pipeline fields and/or 
sources of resource accesses. Moreover, such an 
approach can also rapidly become complex to handle 



and expensive in terms of logic overhead and power 
consumption. 

[0008] A further approach can employ attaching a 
resource encoding to instructions within the pipeline. 

5 However, such an approach can also suffer from disad- 
vantages similar to those described above. 
[0009] There is, therefore, is a need for a different 
approach to resource conflict management within a 
pipeline for avoiding data hazards, which does not suffer 

10 from the disadvantages of the prior approaches 
described above. 

SUMMARY OF THE INVENTION 

is [0010] Particular and preferred aspects of the 
invention are set out in the accompanying independent 
and dependent claims. Combinations of features from 
the dependent claims may be combined with features of 
the independent claims as appropriate and not merely 

20 as explicitly set out in the claims. 

[0011] In accordance with an aspect of the inven- 
tion, there is provided a processing engine including a 
processor pipeline with a plurality of pipeline stages, a 
plurality of resources and a pipeline protection mecha- 

25 nism. The pipeline protection mechanism includes, for 
each protected resource, respective arbitration logic for 
anticipating access conflicts for that resource between 
the pipeline stages. An output of each arbitration logic is 
connected to form stall control signals for controlling the 

30 selective stalling of the pipeline to avoid the resource 
access conflicts. 

[0012] The resources could, for example, be regis- 
ters or parts (e.g. fields) of registers. 
[0013] By providing arbitration logic for each 

35 resource, an embodiment of the invention effectively 
enables a distribution of the control logic needed to 
anticipate potential resource access conflicts, and 
allows selectively stalling of the pipeline to avoid the 
conflicts from actually occurring. With this distributed, or 

40 modular, approach the overall logic can be kept rela- 
tively simple and easy to manage. Also, surprisingly, 
this can result in a reduction in the total logic needed. 
Consequently, less area, or so-called real estate, within 
an integrated circuit will be taken up by the pipeline pro- 

45 tection mechanism than would be the case with the prior 
approaches described above. Moreover, as a result of 
the reduction in the amount of logic needed, the power 
consumption can be reduced, while still providing effec- 
tive pipeline protection. 

so [0014] Preferably, the arbitration logic for each of 
the resources is derived from a generic arbitration logic 
determined for the pipeline. The generic function may 
itself be embodied in the integrated circuit as generic 
arbitration logic capable of handling simultaneous 

55 occurrence of all envisaged conflicts. Each of the arbi- 
tration logic blocks may fully embody the generic arbi- 
tration function, but will typically only embody different 
special forms of the generic arbitration function. The 
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generic arbitration function provides a logical definition 
of all of the potential, or theoretical, conflicts which 
could occur between respective pipeline stages. In 
practice, it may not be physically possible for all of the 
theoretical conflicts to occur for each of the resources, 5 
since the resources concerned may not be accessible 
at all of the pipeline stages being monitored. However, 
configuring the respective arbitration logic blocks from a 
single, generic function simplifies the design of the logic 
for the individual resources, and provides consistent 10 
performance and testability. 

[0015] The processing engine will typically include 
pipeline control logic for controlling the stages of the 
pipeline. This pipeline control logic can be connected to 
receive the stall control signals derived, or output, from 15 
the arbitration logic. Output merge logic can be provided 
for merging the output of each arbitration logic to form 
stall control signals for controlling the selective stalling 
of the pipeline to avoid the resource access conflicts. 
[0016] The pipeline protection mechanism may 20 
comprise an access decoder stage connected to 
receive access information from at least selected pipe- 
line stages to derive access information for respective 
protected resources. The arbitration logic for a pro- 
tected resource can then be connected to receive 2s 
access information for that protected resource from the 
access decoder stage. In this manner, the arbitration 
logic for each protected resource can receive the infor- 
mation it needs to perform a conflict check for that 
resource. 30 
[001 7] The decoder stage may include a plurality of 
access decoders, each access decoder being associ- 
ated with a respective pipeline stage. Input merge logic 
can be provided for each protected resource to merge 
the access information for that resource from the vari- 35 
ous access decoders. 

[0018] The access information can relate to pend- 
ing accesses. It can also relate to current access. 
Indeed, a current access decoding stage can be con- 
nected to receive current access information from the 40 
pipeline to derive current access information for respec- 
tive protected resources, the arbitration logic for a pro- 
tected resource being connected to receive current 
access information for that protected resource as well 
as pending access information. 45 
[0019] In an embodiment of the invention, the cur- 
rent access decoder stage is a decoder stage for a reg- 
ister file, whereby the logic for the register file is reused 
for the pipeline protection mechanism, thus providing a 
saving in the logic required for the processing engine. so 
[0020] Separate input merge logic can be provided 
for each protected resource and be connected to the 
arbitration logic for that resource. 
[0021 ] The processing engine can be in the form of 
a digital signal processor. Alternatively, it could be in the 55 
form of a microprocessor, or any other form of process- 
ing engine employing a pipelined architecture. The 
processing engine can be implemented in the form of an 



integrated circuit. 

[0022] A particular application for a processing 
engine in accordance with the present invention is in the 
form of a wireless telecommunications device, in partic- 
ular a portable telecommunications device such as, for 
example, a mobile telephone, where low power con- 
sumption and high processing performance is required. 
[0023] In accordance with another aspect of the 
invention there is provided a method of protecting a 
pipeline in a processing engine, which processing 
engine includes a processor pipeline with a plurality of 
pipeline stages and a plurality of resources. The method 
comprises, for respective protected resources, sepa- 
rately arbitrating for the resource to anticipate access 
conflicts between the pipeline stages, and selectively 
stalling the pipeline depending upon the result of the 
arbitration for the respective resources to avoid 
resource access conflicts. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0024] Particular embodiments in accordance with 
the invention will now be described, by way of example 
only, and with reference to the accompanying drawings 
in which like reference signs are used to denote like 
pans, unless otherwise stated, and in which: 

Figure 1 is a schematic block diagram of a proces- 
sor in accordance with an embodiment of the inven- 
tion; 

Figure 2 is a schematic diagram of a core of the 
processor of Figure 1 ; 

Figure 3 is a more detailed schematic block dia- 
gram of various execution units of the core of the 
processor of Figure 1 ; 

Figure 4 is schematic diagram of an instruction 
buffer queue and an instruction decoder controller 
of the processor of Figure 1 ; 
Figure 5 is a representation of pipeline phases of 
the processor of Figure 1 ; 
Figure 6 is a diagrammatic illustration of an exam- 
ple of operation of a pipeline in the processor of Fig- 
ure 1; 

Figure 7 is a schematic representation of the core 
of the processor for explaining the operation of the 
pipeline of the processor of Figure 1 ; 
Figure 8A is an example of a read after write haz- 
ard; 

Figure 8B is another example of a read after write 
hazard; 

Figure 9 is an example of a write after write hazard; 
Figure 10A is one example of a write after read haz- 
ard; 

Figure 10B is another example of a write after read 
hazard; 

Figure 11 illustrates possible conflicts at various 
pipeline stages; 

Figure 12 illustrates the derivation of a generic arbi- 
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tration function definition process; 
Figure 13 is a schematic diagram of generic arbitra- 
tion logic; 

Figure 14 illustrates the architecture of a dual-pipe- 
line arithmetic logic unit of a processing engine 
incorporating an embodiment of the invention; 
Figure 15 is a schematic block diagram of an exam- 
ple of pipeline protection logic in accordance with 
the invention; 

Figure 16 is an integrated circuit incorporating the 
processor of Figure 1 ; and 
Figure 17 is an example of mobile telecommunica- 
tions apparatus incorporating the processor of Fig- 
ure 1. 

DESCRIPTION OF PARTICULAR EMBODIMENTS 

[0025] Although the invention finds particular appli- 
cation to Digital Signal Processors (DSPs), imple- 
mented for example in an Application Specific 
Integrated Circuit (ASIC), it also finds application to 
other forms of processing engines. 
[0026] The basic architecture of an example of a 
processor according to the invention will now be 
described. 

[0027] Figure 1 is a schematic overview of a proc- 
essor 10 forming an exemplary embodiment of the 
present invention. The processor 10 includes a 
processing engine 100 and a processor backplane 20. 
In the present embodiment, the processor is a Digital 
Signal Processor 10 implemented in an Application 
Specific Integrated Circuit (ASIC). 
[0028] As shown in Figure 1 , the processing engine 
100 forms a central processing unit (CPU) with a 
processing core 102 and a memory interface, or man- 
agement, unit 104 for interfacing the processing core 
102 with memory units external to the processor core 
102. 

[0029] The processor backplane 20 comprises a 
backplane bus 22, to which the memory management 
unit 104 of the processing engine is connected. Also 
connected to the backplane bus 22 is an instruction 
cache memory 24, peripheral devices 26 and an exter- 
nal interface 28. 

[0030] It will be appreciated that in other embodi- 
ments, the invention could be implemented using differ- 
ent configurations and/or different technologies. For 
example, the processing engine 100 could form the 
processor 10, with the processor backplane 20 being 
separate therefrom. The processing engine 100 could, 
for example be a DSP separate from and mounted on a 
backplane 20 supporting a backplane bus 22, periph- 
eral and external interfaces. The processing engine 100 
could, for example, be a microprocessor rather than a 
DSP and could be implemented in technologies other 
than ASIC technology. The processing engine, or a 
processor including the processing engine, could be 
implemented in one or more integrated circuits. 



[0031] Figure 2 illustrates the basic structure of an 
embodiment of the processing core 102. As illustrated, 
the processing core 102 includes four elements, namely 
an Instruction Buffer Unit (I Unit) 106 and three execu- 

5 tion units. The execution units are a Program Flow Unit 
(P Unit) 108, Address Data Flow Unit (A Unit) 1 10 and a 
Data Computation Unit (D Unit) 112 for executing 
instructions decoded from the Instruction Buffer Unit (I 
Unit) 106 and for controlling and monitoring program 

w flow. 

[0032] Figure 3 illustrates the P Unit 108, A Unit 110 
and D Unit 1 1 2 of the processing core 1 02 in more detail 
and shows the bus structure connecting the various ele- 
ments of the processing core 102. The P Unit 108 

is includes, for example, loop control circuitry, 
GoTo/Branch control circuitry and various registers for 
controlling and monitoring program flow such as repeat 
counter registers and interrupt mask, flag or vector reg- 
isters. The P Unit 108 is coupled to general purpose 

20 Data Write busses (EB, FB) 130, 132, Data Read 
busses (CB, DB) 134, 136 and a coefficient program 
bus (BB) 138. Additionally, the P Unit 108 is coupled to 
sub-units within the A Unit 1 10 and D Unit 1 12 via vari- 
ous busses labeled CSR, ACB and RGD. 

25 [0033] As illustrated in Figure 3, in the present 
embodiment the A Unit 110 includes a register file 30, a 
data address generation sub-unit (DAGEN) 32 and an 
Arithmetic and Logic Unit (ALU) 34. The A Unit register 
file 30 includes various registers, among which are 16 

30 bit pointer registers (AR0 AR7) and data registers 

(DR0 DR3) which may also be used for data flow as 

well as address generation. Additionally, the register file 
includes 16 bit circular buffer registers and 7 bit data 
page registers. As well as the general purpose busses 

35 (EB, FB, CB, DB) 130, 132, 134, 136, a coefficient data 
bus 140 and a coefficient address bus 142 are coupled 
to the A Unit register file 30. The A Unit register file 30 is 
coupled to the A Unit DAGEN unit 32 by unidirectional 
busses 144 and 146 respectively operating in opposite 

40 directions. The DAGEN unit 32 includes 16 bit X/Y reg- 
isters and coefficient and stack pointer registers, for 
example for controlling and monitoring address genera- 
tion within the processing engine 100. 
[0034] The A Unit 1 1 0 also comprises the ALU 34 

45 which includes a shifter function as well as the functions 
typically associated with an ALU such as addition, sub- 
traction, and AND, OR and XOR logical operators. The 
ALU 34 is also coupled to the general-purpose busses 
(EB, DB) 130, 136 and an instruction constant data bus 

so (KDB) 140. The A Unit ALU is coupled to the P Unit 108 
by a PDA bus for receiving register content from the P 
Unit 108 register file. The ALU 34 is also coupled to the 
A Unit register file 30 by busses RGA and RGB for 
receiving address and data register contents and by a 

55 bus RGD for forwarding address and data registers in 
the register file 30. 

[0035] As illustrated, the D Unit 112 includes a D 
Unit register file 36, a D Unit ALU 38, a D Unit shifter 40 
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and two multiply and accumulate units (MAC1 , MAC2) 
42 and 44. The D Unit register file 36, D Unit ALU 38 
and D Unit shifter 40 are coupled to busses (EB, FB, 
CB, DB and KDB) 130, 132, 134, 136 and 140, and the 
MAC units 42 and 44 are coupled to the busses (CB, 
DB, KDB) 134, 136, 140 and data read bus (BB) 144. 
The D Unit register file 36 includes 40-bit accumulators 

(AC0 AC3) and a 16-bit transition register. The D 

Unit 1 12 can also utilize the 16 bit pointer and data reg- 
isters in the A Unit 110 as source or destination regis- 
ters in addition to the 40-bit accumulators. The D Unit 
register file 36 receives data from the D Unit ALU 38 
and MACs 1&2 42, 44 over accumulator write busses 
(ACW0, ACW1) 146, 148, and from the D Unit shifter 40 
over accumulator write bus (ACW1) 148. Data is read 
from the D Unit register file accumulators to the D Unit 
ALU 38, D Unit shifter 40 and MACs 1&2 42, 44 over 
accumulator read busses (ACR0, ACR1) 150, 152. The 
D Unit ALU 38 and D Unit shifter 40 are also coupled to 
sub-units of the A Unit 108 via various busses labeled 
EFC, DRB, DR2 and ACB. 

[0036] Referring now to Figure 4, there is illustrated 
an instruction buffer unit 106 comprising a 32 word 
instruction buffer queue (IBQ) 502. The IBQ 502 com- 
prises 32x16 bit registers 504, logically divided into 8 bit 
bytes 506. Instructions arrive at the IBQ 502 via the 32- 
bit program bus (PB) 122. The instructions are fetched 
in a 32-bit cycle into the location pointed to by the Local 
Write Program Counter (LWPC) 532. The LWPC 532 is 
contained in a register located in the P Unit 108. The P 
Unit 1 08 also includes the Local Read Program Counter 
(LRPC) 536 register, and the Write Program Counter 
(WPC) 530 and Read Program Counter (RPC) 534 reg- 
isters. LRPC 536 points to the location in the IBQ 502 of 
the next instruction or instructions to be loaded into the 
instruction decoder(s) 512 and 514. That is to say, the 
LRPC 534 points to the location in the IBQ 502 of the 
instruction currently being dispatched to the decoders 
512, 514. The WPC points to the address in program 
memory of the start of the next 4 bytes of instruction 
code for the pipeline. For each fetch into the IBQ, the 
next 4 bytes from the program memory are fetched 
regardless of instruction boundaries. The RPC 534 
points to the address in program memory of the instruc- 
tion currently being dispatched to the decoder(s) 512 
and 514. 

[0037] The instructions are formed into a 48-bit 
word and are loaded into the instruction decoders 512, 
514 over a 48-bit bus 516 via multiplexors 520 and 521 . 
It will be apparent to a person of ordinary skill in the art 
that the instructions may be formed into words compris- 
ing other than 48-bits, and that the present invention is 
not limited to the specific embodiment described above. 
[0038] The bus 516 can load a maximum of two 
instructions, one per decoder, during any one instruc- 
tion cycle. The combination of instructions may be in 
any combination of formats, 8, 16, 24, 32, 40 and 48 
bits, which will fit across the 48-bit bus. Decoder 1,512, 



is loaded in preference to decoder 2, 514, if only one 
instruction can be loaded during a cycle. The respective 
instructions are then forwarded on to the respective 
function units in order to execute them and to access 

5 the data for which the instruction or operation is to be 
performed. Prior to being passed to the instruction 
decoders, the instructions are aligned on byte bounda- 
ries. The alignment is done based on the format derived 
for the previous instruction during decoding thereof. The 

w multiplexing associated with the alignment of instruc- 
tions with byte boundaries is performed in multiplexors 
520 and 521. 

[0039] The processor core 102 executes instruc- 
tions through a 7 stage pipeline, the respective stages 
is of which will now be described with reference to Figure 
5. 

[0040] The first stage of the pipeline is a PRE- 
FETCH (P0) stage 202, during which stage a next pro- 
gram memory location is addressed by asserting an 

20 address on the address bus (PAB) 118 of a memory 
interface, or memory management unit 104. 
[0041] In the next stage, FETCH (P1) stage 204, 
the program memory is read and the I Unit 106 is filled 
via the PB bus 122 from the memory management unit 

25 104. 

[0042] The PRE-FETCH and FETCH stages are 
separate from the rest of the pipeline stages in that the 
pipeline can be interrupted during the PRE-FETCH and 
FETCH stages to break the sequential program flow 

30 and point to other instructions in the program memory, 
for example for a Branch instruction. 
[0043] The next instruction in the instruction buffer 
is then dispatched to the decoder/s 512/514 in the third 
stage, DECODE (P2) 206, where the instruction is 

35 decoded and dispatched to the execution unit for exe- 
cuting that instruction, for example to the P Unit 108, the 
A Unit 110 or the D Unit 112. The decode stage 206 
includes decoding at least part of an instruction includ- 
ing a first part indicating the class of the instruction, a 

40 second part indicating the format of the instruction and 
a third part indicating an addressing mode for the 
instruction. 

[0044] The next stage is an ADDRESS (P3) stage 
208, in which the address of the data to be used in the 

45 instruction is computed, or a new program address is 
computed should the instruction require a program 
branch or jump. Respective computations take place in 
the A Unit 1 10 or the P Unit 108 respectively. 
[0045] In an ACCESS (P4) stage 210 the address 

so of a read operand is output and the memory operand, 
the address of which has been generated in a DAG EN 
X operator with an Xmem indirect addressing mode, is 
then READ from indirectly addressed X memory 
(Xmem). 

55 [0046] The next stage of the pipeline is the READ 
(P5) stage 212 in which a memory operand, the 
address of which has been generated in a DAGEN Y 
operator with an Ymem indirect addressing mode or in 
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a DAGEN C operator with coefficient address mode, is 
READ. The address of the memory location to which the 
result of the instruction is to be written is output. 
[0047] In the case of dual access, read operands 
can also be generated in the Y path, and write operands s 
in the X path. 

[0048] Finally, there is an execution EXEC (P6) 
stage 214 in which the instruction is executed in either 
the A Unit 110 or the D Unit 112. The result is then 
stored in a data register or accumulator, or written to 10 
memory for Read/Modify/Write or store instructions. 
Additionally, shift operations are performed on data in 
accumulators during the EXEC stage. 
[0049] The basic principle of operation for a pipeline 
processor will now be described with reference to Fig- is 
ure 6. As can be seen from Figure 6, for a first instruc- 
tion 302, the successive pipeline stages take place over 
time periods T r T 7 . Each time period is a clock cycle for 
the processor machine clock. A second instruction 304, 
can enter the pipeline in period T 2 , since the previous 20 
instruction has now moved on to the next pipeline stage. 
For instruction 3, 306, the P RE-FETCH stage 202 
occurs in time period T 3 . As can be seen from Figure 6 
for a seven stage pipeline a total of 7 instructions may 
be processed simultaneously. For all 7 instructions 302- 2s 
314, Figure 6 shows them all under process in time 
period T 7 . Such a structure adds a form of parallelism to 
the processing of instructions. 
[0050] As shown in Figure 7, the present embodi- 
ment of the invention includes a memory management 30 
unit 104 which is coupled to external memory units via a 
24 bit address bus 1 14 and a bi-directional 16 bit data 
bus 116. Additionally, the memory management unit 
1 04 is coupled to program storage memory (not shown) 
via a 24 bit address bus 1 18 and a 32 bit bi-directional 35 
data bus 120. The memory management unit 104 is 
also coupled to the I Unit 106 of the machine processor 
core 102 via a 32 bit program read bus (PB) 122. The P 
Unit 1 08, A Unit 1 1 0 and D Unit 1 1 2 are coupled to the 
memory management unit 104 via data read and data 40 
write busses and corresponding address busses. The P 
Unit 108 is further coupled to a program address bus 
128. 

[0051] More particularly, the P Unit 108 is coupled 
to the memory management unit 104 by a 24 bit pro- 45 
gram address bus 1 28, the two 1 6 bit data write busses 
(EB, FB) 130, 132, and the two 16 bit data read busses 
(CB, DB) 134, 136. The A Unit 110 is coupled to the 
memory management unit 104 via two 24 bit data write 
address busses (EAB, FAB) 160, 162, the two 16 bit so 
data write busses (EB, FB) 130, 132, the three data 
read address busses (BAB, CAB, DAB) 164, 166, 168 
and the two 16 bit data read busses (CB, DB) 134, 136. 
The D Unit 1 12 is coupled to the memory management 
unit 104 via the two data write busses (EB, FB) 130, 132 55 
and three data read busses (BB, CB, DB) 144, 134, 136. 
[0052] Figure 7 represents the passing of instruc- 
tions from the I Unit 1 06 to the P Unit 1 08 at 1 24, for for- 



warding branch instructions for example. Additionally, 
Figure 7 represents the passing of data from the I Unit 
1 06 to the A Unit 1 1 0 and the D Unit 1 1 2 at 1 26 and 1 28 
respectively. 

[0053] A difficulty with the operation of a pipeline is 
that different instructions may need to make access to 
one and the same resource. Quite often, a first instruc- 
tion will be operable to modify a resource, for example a 
register of a part, for example a field, of a register and a 
second instruction may then need to access that 
resource. If the instructions were being processed sep- 
arately with the processing of the second instruction 
only being commenced when the processing of the first 
instruction has finished, this would not create a conflict. 
However, in a pipelined architecture, there is a possibil- 
ity that a second instruction could access the resource 
before the first instruction has finished with it, unless 
measures are undertaken to prevent this. 
[0054] Such potential conflicts are often termed 
"data hazards". Examples of possible data hazards are 
in cases of, for example: 



Read after Write (e.g. 

*ARx = k16) 

Write after Read (e.g. 

mar(ARy=P16)) 

Write after Write (e.g. 

mar(ARx=P16)) 



ARx = ARy followed by 
ARx = ARy followed by 



ARx = ARy followed by 



[0055] Figure 8A represents an example of a pipe- 
line protection action for a Read After Write (RAW). 
Step 600 represents a write performed by a first instruc- 
tion in an execute phase (EXE) on a register AR1 (e.g. 
AR1 = AR0 + K16). Step 602 represents a read per- 
formed in the address phase (ADR) on AR1 (e.g. 
AC0 = *AR1 ). A pipeline protection action (604) com- 
prises setting a stall 606 for the address phase, 
whereby the addresses for the read are not generated 
at 610 (the read of AR1 is not valid) until after the write 
to AR1 is performed at 612, the new AR1 value being 
available and the stall for the address phase being 
relaxed (removed) at 614. 

[0056] Figure 8B represents another example of a 
pipeline protection action for a Read After Write (RAW). 
Step 620 represents a write performed by a first instruc- 
tion in an execute phase (EXE) on a register AR0 (e.g. 
AR0 = AC0 + K16). Step 602 represents a read per- 
formed in the read phase (RD) on AR0 (e.g. Condition 
Read/MMR read). A pipeline protection action (624) 
comprises setting a stall 626 for the access phase 
(ACC), whereby the addresses and requests are kept 
active at 628, a write on AC0 is performed at 630 and 
the stall of the access phase is relaxed (removed) and 
the Condition/MMR new value is available at 632. 
[0057] Figure 9 represents an example of a pipeline 
protection action for a Write After Write (WAW). Step 
640 represents a write performed by a first instruction in 
an execute phase (EXE) on a register AR1 (e.g. 
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AR1 = ARO + K16). Step 642 represents a write per- 
formed in the address phase (ADR) on AR1 (e.g. 
ACO = *AR1 + ). A pipeline protection action (644) com- 
prises setting a stall 646 for the address phase, 
whereby the addresses for the second write to AR1 are 
not generated at 650 (the write to AR1 is not allowed) 
until after the first write to AR1 is performed at 652, the 
new AR1 value being available and the stall for the 
address phase being relaxed (removed) at 654. 
[0058] Figure 1 0A represents an example of a pipe- 
line protection action for a Write After Read (WAR). 
Step 660 represents a read performed by a first instruc- 
tion in a read phase (RD) on a register AR3 (e.g. 
AC2 = AR3 & K8). Step 662 represents a write per- 
formed in the address phase (ADR) on AR3 (e.g. 
*AR3+DR0). A pipeline protection action (664) com- 
prises setting a stall 666 for the address phase, 
whereby the addresses for the write to AR3 are not gen- 
erated at 670 (the write to AR3 is not allowed) until after 
the read of AR3 is performed at 672, the AR3 write 
being allowed and the stall for the address phase being 
relaxed (removed) at 674. 

[0059] Figure 1 0B represents another example of a 
pipeline protection action for a Write After Read (WAR). 
Step 680 represents a read performed by a first instruc- 
tion in a read phase (RD) on a register AR3 (e.g. Condi- 
tion or MMR). Step 682 represents a write performed in 
the address phase (ADR) on AR3 (e.g. *AR3 + DR0). A 
pipeline protection action (684) comprises setting a stall 
686 for the address phase, whereby the addresses for 
the write to AR3 are not generated at 690 (the write to 
AR3 is not allowed) until after the read of AR3 is per- 
formed at 690, the write to AR3 then being allowed and 
the stall for the address phase being relaxed (removed). 
[0060] Figure 1 1 is a schematic of an approach 
adopted for determining a definition of a generic arbitra- 
tion function of all possible resource access conflicts of 
the pipeline. The generic arbitration function is an 
abstract concept which underpins and simplifies the 
design of each arbitration logic, which can all be imple- 
mented as special forms of the generic function. Logic 
in which the generic arbitration function is embedded 
may also be included in the processing engine. The 
generic arbitration function can also aid circuit testing at 
the end of the circuit design. Although described in the 
context of the present embodiment, this approach could 
be used for other processor architectures. 
[0061 ] Referring to Figure 1 1 , in an initial stage the 
organization (700) of the processing engine as a whole 
is divided into groups of registers, or register files (e.g. 
702, 704, 706). In the present example, three register 
files exist for the program unit, or control flow (CF), for 
the data unit (DU) and for the address unit (AU). Each of 
the register files comprises a number of registers N(i) 
(e.g., 708, 710, 712). These registers can form the 
resources to be protected. As well as, or instead of, pro- 
tecting whole registers, it may be desired to protect 
parts of (or fields within) registers (e.g. 714, 716, 718). 



Figure 1 1 represents this definition of resource granu- 
larity. Accordingly, a protected resource could, for exam- 
ple, be a register or a field within a register. 
[0062] For each protected resource, an analysis of 

5 the worst possible resource usage is derived. Different 
instructions will provide different ways of reading from 
and writing to a resource, for example. As shown in Fig- 
ure 11 in respect of register field 714, the pipeline 
stages represented at 720 in which read/write opera- 

10 tions could be performed for that resource are stages 
P3, P5 and P6. In other words, the worst case resource 
usage for this resource is in respect of pipeline stages 
P3, P5 and P6. The above accesses can be classified in 
terms of their execution stages. 

is [0063] The pipeline depth (e.g., here pipeline 
stages P2, P3, P4, P5 and P6) has to be taken into 
account to consider the instruction execution overlap, as 
represented at 722 for five instructions 10, 11, 12, 13 and 
14. 

20 [0064] All potential data hazards are considered as 
represented by the arrows in Figure 12, which Figure is 
a schematic diagram illustrating potential conflicts at dif- 
ferent stages of a pipeline. Figure 12 illustrates the five 
stages P2 - P6 of Figure 5 for each of the four instruc- 

25 tions 10, 11, 12 and 13 and 14. 

[0065] From the consideration of the data hazards 
found, a generic arbitration function can be derived as 
illustrated in Figure 13, this generic arbitration function 
defining relationships between current and pending 

30 accesses. The generic arbitration function can then be 
used to control the selective stalling of the pipeline to 
avoid data hazards. The generic logic is representative 
of all potential resource access conflicts for the pipeline. 
From an analysis of the potential conflict problems rep- 

35 resented schematically in Figure 12, the following sig- 
nals susceptible to resulting in a conflict can be 
identified, namely: 

a: current read stage 4 
40 b: pending (stage 5) read stage 7 

c: pending (stage 5) read stage 6 

d: current read stage 6 

e: pending (stage 6) read stage 7 

f: current read stage 7 
45 1 : current write stage 4 

2: pending (stage 5) write stage 7 

3: pending (stage 5) write stage 6 

4: pending (stage 6) write stage 7 

5: current write stage 6 
so 6: current write stage 7 

[0066] The logic for interpreting these signals is 
illustrated in Figure 13. It will be noted that signals "d" 
and "f" are not shown in Figure 13. These signals are 
55 not, however, needed as all potential conflicts dealing 
with a 7-stage read are solved by anticipating using the 
pending signals "b" and "e". Accordingly, these signals 
do not need to be represented in Figure 13 as any con- 
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flicts related thereto can already be predicted. 
[0067] It will thus be appreciated that in general the 
generic function will have a large number of variable 
operands and that each arbitration logic will be a special 
form or sub-set of the generic function with a lower 
degree of degeneracy, i.e. with a number of the oper- 
ands which are variable in the generic function being 
fixed. 

[0068] Once determined, the generic arbitration 
function can be used to implement the circuit design of 
each of the arbitration logic blocks, which are all defina- 
ble as special forms of the generic arbitration function. 
The full generic form is not needed for the arbitration 
logic of each individual resource, since, for each of the 
resources, it will in general be impossible for some of 
the conflicts envisaged by the generic arbitration func- 
tion to occur. 

[0069] Generic arbitration logic embodying the 
generic arbitration function need only be provided in the 
processing engine if full protection is desired against 
simultaneous occurrence of all envisaged conflicts. 
[0070] The concept of the generic arbitration func- 
tion can be further exploited at the stage of software 
testing of the hardware design of the processing engine. 
In general, generating all the test patterns for pipeline 
processing engine hardware can be a huge undertaking 
because of the complexity of the CPU, its instruction set 
and architecture. The test patterns need to be defined in 
terms of a prespecif ied reference. It is the specification 
of this reference which can be highly laborious. With the 
present design, a functional test pattern generator can 
be created using the generic function as the reference 
in association with a conventional instruction set latency 
table. This simplifies the creation of the test pattern gen- 
erator since the scope of the testing can be restricted to 
the possible conflicts envisaged by the generic function. 
Because the test pattern generator follows directly from 
the generic function, the process of hardware design 
testing is not only quicker, but also more systematic and 
ensures good coverage. 

[0071] Figure 14 is a schematic overview of an 
interlocked architecture for a processing engine. As 
shown in Figure 14, there are first and second pipelines 
820 and 850, receiving instructions from a control flow 
800. In terms of Figure 2, the first pipeline could be the 
D unit and the second pipeline could be the A unit, for 
example. 

[0072] The control flow includes an instruction 
buffer 810 and first and second decoders 812 and 814, 
for decoding first and second instruction streams. A par- 
allel encoding validity check is effected in parallel verifi- 
cation logic 816, to ensure that the parallel context is 
valid. The instructions from the decoders 812 and 814 
are dispatched from dispatch logic 818 under the con- 
trol of a dispatch controller 808. 
[0073] In the first pipeline 820, successive pipeline 
stages 822, 824, 826 and 828 are under the control of a 
local pipeline controller 830. Associated with the first 



pipeline 820 is first local interlock control logic 838 form- 
ing a first local interlock controller. The pipeline control- 
ler is responsive to control signals from the associated 
interlock control logic to cause selective stalling of the 

5 pipelines stages. This is responsive to outputs from the 
pipeline 820 and also to outputs from a register file 832 
for the pipeline 820. The register file 832 includes regis- 
ter file control logic 834 and individual registers 836. 
One or more operators 840 and 842 may be accessed 

w in respect of a current access operation. 

[0074] In the second pipeline 850, successive pipe- 
line stages 852, 854, 856 and 858 are under the control 
of a local pipeline controller 860. Associated with the 
second pipeline 850 is second local interlock control 

is logic 868 forming a second local interlock controller. 
The pipeline controller is responsive to control signals 
from the associated interlock control logic to cause 
selective stalling of the pipeline stages. This is respon- 
sive to outputs from the pipeline 850 and also to outputs 

20 from a register file 862 for the pipeline 850. The register 
file 862 includes register file control logic 864 and indi- 
vidual registers 866. One or more operators 870 and 
872 may be accessed in respect of a current access 
operation. 

25 [0075] It will be noted that each of the local pipeline 
controllers 830 and 860 is responsive to outputs from 
each of the local interlock controllers 838 and 868. This 
general principle is extendible. Accordingly, where more 
than two pipelines are provided, the local pipeline con- 

30 trailer for each pipeline will be responsive to the outputs 
from all of the local interrupt controllers. 
[0076] Thus, in Figure 14, the natural partitioning of 
the interlock control is the same as for the register files. 
However, this need not be the case and it may be desir- 

35 able to move an individual interlock control (e.g., 838 or 
868) from its original register file to another depending 
on the arbitration function information location (pending 
verses current accesses). 

[0077] As mentioned above, in the present embodi- 
ed ment, there are three register files, namely for the con- 
trol flow (CF), for the D unit (DU) and for the A unit (AU). 
Accordingly three sets of local interlock control logic are 
provided. The physical location of the control logic is, 
however, distributed such that pending and/or current 
45 accesses information is mainly located at the respective 
location (AU.CF). For the D unit, the interlock logic is 
moved to the control flow unit, where the biggest per- 
centage of signals for control is pending in the instruc- 
tion pipeline. By re-using as much as possible current 
so accesses of the register files the logic overhead can be 
minimized. Stalls which are generated are spread within 
all the CPU sub-units having a pipeline and the associ- 
ated local pipeline control logic. 
[0078] A schematic overview of an exemplary struc- 
55 ture for an interlock control mechanism is illustrated in 
Figure 15, for example, for the pipelines 820 of Figure 
14. It will be understood that the mechanism could have 
the same structure for other pipelines, such as the pipe- 
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line 850 of Figure 14. It will be noted that no memory 
elements (read/write queue) are provided for stall man- 
agement as the instruction pipeline itself is used to 
achieve this. For example, a write after write conflict 
from stage P4 to stage P7 of the pipeline between two 
consecutive instructions should generate a 3 cycle stall 
(at stage 4). In practice, the interlock logic generates 3 
x 1 cycle consecutive stalls (at stage 4). 
[0079] Figure 15 illustrates a regular and parallel 
structure for hazard detection, including: 

A first level comprises basic decoders 882 (from 
pending signals or current accesses). These 
decoders are the same as decoders 880 in the reg- 
ister files but are applied on the pending signals. 
The decoder logic is responsive to access informa- 
tion from at least selected pipeline stages to derive 
access information for respective protected 
resources. The decoders 882 are operable to 
decode pending access information. The decoders 
880 are operable to decode current accesses. 

- A second level comprises a stage 884 of merging of 
the equivalent signals (in the arbitration function 
sense) for each register to protect. This is achieved 
by ORing those signals in OR gates, for example 
using logic as illustrated in Figure 13. The output of 
the decoders 880 for current accesses are merged 
in merge logic 883, and then are supplied to merge 
logic 884, where they are merged with the output of 
the decoders 882 for pending access. 

- A third level is composed of as many sets of arbitra- 
tion logic 886 as there are registers to protect. The 
arbitration logic is extracted from the generic arbi- 
tration function illustrated in Figure 13, according to 
the inputs thereto (i.e. it forms a sub-set of the arbi- 
tration logic of Figure 13), and is applied (reduced) 
to each register access trace. The register access 
traces are formed from incoming signals specifying 
an access/phase. 

- A fourth level is simply the merge 888 of all the arbi- 
tration results, for example using OR gates. Each 
set of arbitration logic generates between 1 and 3 
stalls (at stages 3, 4 and 5). All the stalls of the 
same stage are merged together. The merged out- 
put signals are supplied as stall control signals 889 
to the associated pipeline control logic for control- 
ling selective stalling of the pipeline. 

[0080] The stall control signals 889 are also sup- 
plied to register access control logic 890 current access 
control. Stall penalty reduction is not considered in this 
architecture, with the result that any conflict will result in 
an appropriate pipeline stall, that is a freeze of the lower 
stages and bubble insertion at the next stage. 
[0081] The arbitration logic is relatively simple in 
hardware as a result of its 'logic re-use'. By logic re-use 
it is meant that the arbitration logic makes use of tap- 
ping from the queue of the existing main pipeline 822 - 



828 (rather than creation of a new queue for arbitration 
purposes which has been previously proposed) and 
also makes use of the results from the decoders 880, in 
the embodiment through the merge logic 883. Conse- 

5 quently, the additional amount of hardware required for 
the arbitration logic blocks is greatly reduced. In a spe- 
cific hardware implementation of the embodiment as a 
DSP integrated circuit, all the arbitration logic covers 
less than 2% of the total CPU area. By contrast, it is 

w estimated that without logic re-use the chip area 
required for the logic necessary to provide a compara- 
ble level of pipeline protection would be at least several 
times greater, perhaps an order of magnitude greater. 
[0082] There has been described a pipeline protec- 

15 tion mechanism which, as a result of its regularity and 
generality is straightforward to implement and to test. 
Queuing of read/write pendings (pending operations) is 
handled by the pipeline itself. Thus the interlock detec- 
tion logic is purely combinatorial and does not require a 

20 read/write queue as part of the interlock mechanism. 
[0083] Figure 16 is a schematic representation of 
an integrated circuit 40 incorporating the processor 10 
of Figure 1 . The integrated circuit can be implemented 
using application specific integrated circuit (ASIC) tech- 

25 nology. As shown, the integrated circuit includes a plu- 
rality of contacts 42 for surface mounting. However, the 
integrated circuit could include other configurations, for 
example a plurality of pins on a lower surface of the cir- 
cuit for mounting in a zero insertion force socket, or 

30 indeed any other suitable configuration. 

[0084] One application for a processing engine 
such as the processor 10, for example as incorporated 
in an integrated circuit as in Figure 16, is in a telecom- 
munications device, for example a mobile wireless tele- 

35 communications device. Figure 17 illustrates one 
example of such a telecommunications device. In the 
specific example illustrated in Figure 17, the telecom- 
munications device is a mobile telephone 1 1 with inte- 
grated user input device such as a keypad, or keyboard 

40 1 2 and a display 1 4. The display could be implemented 
using appropriate technology, as, for example, a liquid 
crystal display or a TFT display. The processor 10 is 
connected to the keypad 12, where appropriate via a 
keyboard adapter (not shown), to the display 14, where 

45 appropriate via a display adapter (not shown), and to a 
telecommunications interface or transceiver 16, for 
example a wireless telecommunications interface 
including radio frequency (RF) circuitry. The radio fre- 
quency circuitry could be incorporated into, or separate 

so from, an integrated circuit 40 comprising the processor 
10. The RF circuitry 16 is connected to an aerial 18. 
[0085] It will be appreciated that although particular 
embodiments of the invention have been described, 
many modifications/additions and/or substitutions may 

55 be made within the scope of the present invention. 
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Claims 

1 . A processing engine including a processor pipeline 
with a plurality of pipeline stages, a plurality of 
resources and a pipeline protection mechanism, 
the pipeline protection mechanism comprising, for 
each protected resource, an arbitration logic for 
anticipating access conflicts for that resource 
between the pipeline stages, an output of each arbi- 
tration logic being connected to form stall control 
signals for selective stalling of stages of the pipeline 
to avoid resource access conflicts. 

2. A processing engine according to claim 1 , wherein 
each arbitration logic is definable as a specific form 
of a single, generic arbitration function. 

3. A processing engine according to claim 2, wherein 
the generic arbitration function is embedded in 
generic arbitration logic of the processing engine. 

4. A processing engine according to any preceding 
claim, including pipeline control logic for controlling 
the stages of the pipeline, the pipeline control logic 
being connected to receive the stall control signals 
output from the arbitration logic. 

5. A processing engine according to any preceding 
claim, wherein the pipeline protection mechanism 
comprises output merge logic for merging the out- 
put of each arbitration logic to form stall control sig- 
nals for controlling the selective stalling of the 
pipeline to avoid resource access conflicts. 

6. A processing engine according to any preceding 
claim, wherein each arbitration logic is connected to 
receive access information from the pipeline. 

7. A processing engine according to claim 6, wherein 
each arbitration logic is also connected to receive 
further control signals related to the protected 
resource associated with the arbitration logic con- 
cerned. 

8. A processing engine according to any preceding 
claim and comprising an decoder stage connected 
to receive access information from the pipeline to 
derive access information for respective ones of the 
protected resources. 

9. A processing engine according to claim 7 or claim 
8, wherein the further control signals are outputs 
from the decoder stage. 

10. A processing engine according to claim 6, wherein 
the decoder stage includes a plurality of access 
decoders, each access decoder being associated 
with a respective pipeline stage, and wherein the 



pipeline protection mechanism comprises, for at 
least one protected resource, input merge logic for 
merging the access information for that resource 
from the access decoders. 

5 

11. A processing engine according to claim 10, wherein 
the access information relates to pending 
accesses. 

10 12. A processing engine according to claim 1 1 , further 
comprising a current access decoder stage con- 
nected to receive current access information from 
the pipeline to derive current access information for 
respective protected resources, the arbitration logic 

is for a protected resource being further connected to 
receive current access information for that pro- 
tected resource. 

1 3. A processing engine according to claim 1 2, wherein 
20 the current access decoder stage is a decoder 

stage for a register file. 

14. A processing engine according to claim 12 or claim 
13, wherein the current access information is also 

25 supplied to the input merge logic. 

1 5. A processing engine according to any one of claims 
10 to 13, comprising, for each protected resource, 
respective input merge logic for that resource. 

30 

16. A processing engine according to any preceding 
claim, wherein at least one resource is one of: a 
group of registers; a register; a field of a register; 
and a sub-field of a register. 

35 

17. A processing engine according to any preceding 
claim in the form of a digital signal processor. 

18. An integrated circuit including a processing engine 
40 according to any preceding claim. 

19. Telecommunications apparatus comprising a 
processing engine according to any one of claims 1 
to 17. 

45 

20. Telecommunications apparatus according to claim 
19, comprising a user input device, a display, a 
wireless telecommunications interface and an aer- 
ial. 

50 

21. A method of protecting a pipeline in a processing 
engine, which processing engine includes a proc- 
essor pipeline with a plurality of pipeline stages and 
a plurality of resources, the method comprising, for 

55 respective protected resources, separately arbitrat- 
ing for the resource to anticipate access conflicts 
between the pipeline stages, and selectively stalling 
stages of the pipeline depending upon the result of 
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the arbitration for the respective resources to avoid 
resource access conflicts. 

22. A method according to claim 21, wherein arbitration 
logic for each protected resource is derived from a s 
generic arbitration function. 

23. A method according to claim 22, wherein the 
generic arbitration function is representative of all 
potential resource access conflicts for the pipeline. 10 

24. A method according to any one of claims 21 to 23, 
wherein access information from at least selected 
pipeline stages is decoded to derive access infor- 
mation for respective protected resources. is 

25. A method according to claim 24, wherein, for at 
least one protected resource, access information 
for a plurality of pipeline stages is merged for arbi- 
tration by arbitration logic for that resource. 20 

26. A method according to claim 25, wherein the 
access information relates to pending accesses. 

27. A method according to claim 26, wherein further 25 
access information also relates to current resource 
accesses. 

28. A method according to any one of claims 21 to 27, 
wherein at least one resource is one of: a group of so 
registers; a register; a field of a register; and a sub- 
field of a register. 

29. A method of software testing a hardware design of 

a multi-stage pipeline processor having a generic 35 
function defining possible conflicts between the 
pipeline stages of the processor, the method com- 
prising: 

applying the generic function together with an 40 
instruction set latency table to create a test pat- 
tern generator; and 

applying the test pattern generator to test the 
hardware design. 
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